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Bayesian curve fitting for lattice gauge theorists 

C. Morningstar*^ 

'^Physics Department, Carnegie Mellon University, Pittsburgh, PA, USA 

A new method of extracting the low-lying energy spectrum from Monte Carlo estimates of Euclidean-space 
correlation functions which incorporates Bayesian inference is described and tested. The procedure fully exploits 
the information present in the correlation functions at small temporal separations and uses this information in a 
way consistent with fundamental probabilistic hypotheses. The computed errors on the best-fit energies include 
both statistical uncertainties and systematic errors associated with the treatment of contamination from higher- 
lying stationary states. Difficulties in performing the integrals needed to compute these error estimates are briefly 
discussed. 


1. INTRODUCTION 

The physical consequences of a quantum field 
theory can be deduced from the correlation func¬ 
tions of the fields. Usually this is done by match¬ 
ing the correlation functions (or suitable combi¬ 
nations of them) to known behavior which con¬ 
tains the observables of interest. The path in¬ 
tegrals which yield the correlation functions are 
often computed using Monte Carlo estimates. 

To estimate the low-lying energy spectrum of 
the stationary states of a theory, one studies a 
matrix of correlation functions 

), ( 1 ) 

where { (j>i{t) ) = 0 and are operators capa¬ 
ble of creating the states of interest. As the tem¬ 
poral separation t becomes large, the elements of 
this correlation matrix tend (neglecting boundary 
conditions) to a handful of damped exponentials 
whose decay rates yield the lowest-lying energies 
of the system. To reliably determine these decay 
rates, the correlators must be calculated 

for large enough t such that they are well approx¬ 
imated by only small numbers of exponentials. 
The Monte Carlo estimates of the correlation ma¬ 
trix where now D denotes Monte Carlo es¬ 

timates, must then be fit to known asymptotic 
forms where M denotes the model fit¬ 

ting functions. The fit is carried out by adjusting 
the fit parameters, collectively denoted by u, in 
tiljit) in order to maximize the likelihood of find¬ 


ing the results /i^(t) given h^lf{t). This amounts 
to minimizing 

x2=^(m„(m)-D„) (^M0iu)-Dp), (2) 

0.(3 

where Mct{u) refers to and refers to 

The indices a and (3 each include the in¬ 
dices of the correlation matrix as well as the in¬ 
stants in time at which the correlation matrix is 
sampled. Call is the covariance matrix associ¬ 
ated with Da- One commonly-used method is to 
use the fitting function (which neglects boundary 
conditions) 

n—1 

h%{t) = Y, (3) 

p—0 

with > Ep^ and fit the nxn matrix for 

tmin < i < imax- If the n Operators have been con¬ 
structed using a variational approach applied to a 
much larger set of operators, the off-diagonal ma¬ 
trix elements are often statistically con¬ 

sistent with zero. In such cases, little is gained 
by including these correlation functions in the fit¬ 
ting, and hence, one may fit only to the n diagonal 
correlators using the fitting function 

n—1 

P—0 

where now the are usually real and positive. 

When using either Eq. (||) or (^ as the fitting 
function, it is crucial that tmin be chosen such 
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Figure 1. Effective mass plot showing fits for the 
lowest four energies of Example 1 using the model 
fitting function from Eq. and 7 < t < 14. The 
data points in this figure are effective masses as 
defined by Eq. (5.18) in Ref. 0 such that the low¬ 
est n effective masses tend to the lowest n physical 
masses as the temporal separation becomes large. 
Note that the correlation functions themselves are 
used in the fitting, not the effective mass points 
which are included only for illustrative purposes. 
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that contributions from all higher energy eigen¬ 
states are negligible. Generally one is guided by 
the quality of fit to determine this, but the guide¬ 
lines for doing so are not very clear. Eor exam¬ 
ple, one could start with tmin = 0, for which the 
fit quality is usually very bad, and increase fmin 
by one temporal lattice spacing at a time, keep- 

Icig tmax = tmin “f ^window for fixed t.\vindowj Until 

some acceptable fit quality, such as Q = 0.2, is 
achieved. But the acceptable value of Q and the 
value of twindow are somewhat subjective, and it 
is not clear how to incorporate the uncertainties 
in choosing these values into the final best-fit val¬ 
ues for the energy levels. More stringent values 
of Qacc and twindow generally lead to larger val¬ 
ues of fmin, and hence, larger errors in the best-fit 
energies. The subjectivity of these errors and the 


Table 1 

Estimates for the lowest-lying energies Eq and Ei 
from the standard analysis applied to Example 
1. The fits are determined for times t satisfying 
train < t < tmax and Q indicates the fit quality. 
The errors shown on Eq and Ei are determined 
using the bootstrap method. 


^min 

^max 

Q 

Eq 

El 

3 

10 

0 

0.130632(27) 

0.32435(10) 

4 

11 

0 

0.130514(24) 

0.32319(10) 

5 

12 

0 

0.130449(27) 

0.32268(14) 

6 

13 

0.23 

0.130421(33) 

0.32217(21) 

7 

14 

0.69 

0.130413(32) 

0.32210(26) 

8 

15 

0.21 

0.130404(31) 

0.32206(32) 


loss of information associated with the discarded 
t < train measurements of the correlators are un¬ 
desirable features of this approach. 

Eor illustrative purposes, consider two exam¬ 
ples, one with an excellent signal (Example 1), 
the other with a questionable signal (Example 2). 
Both examples are energies of a gauge field in the 
presence of a static positive and a static nega¬ 
tive charge in 2-|-1-dimensional compact U(l) lat¬ 
tice gauge theory using an improved action on 
an anisotropic lattice. Typical fit results for the 
lowest few energies using the standard analysis 
with the fitting function from Eq. (|0 are shown 
in Figures |I[0. Sg denotes the chosen symmetry 
channel and R is the number of lattice sites be¬ 
tween the charges. Tables ^ and ^ detail the fit 
results for the lowest-lying energies Eq and Ei in 
each example. In Example 1, a suitable choice for 
tinin and tniax Is easily guided by the fit quality 
Q and the final results are reasonably insensitive 
to minor changes in the fit range. In Example 
2, one might agonize over whether = 4 or 5 
should be used. The standard analysis provides 
no means of incorporating such an uncertainty 
into the error estimate. 

In order to incorporate information from small 
time separations, many exponentials must be re¬ 
tained in the model fitting functions. Standard 
fits using many exponentials quickly encounter 
severe problems: often, fitting instabilities occur 
and uncertainties in the estimates of the param¬ 
eters of interest become very large. The source 
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Figure 2. Effective mass plot showing fits for the 
lowest four energies of Example 2 using the model 
fitting function from Eq. (^) and 4 < t < 8. Other 
details in this figure are the same as in Fig. |^. 

of these problems is the fact that unconstrained 
fits allow physically insensible or impossible val¬ 
ues for those parameters which are not adequately 
constrained by the data. The solution to this 
problem is to introduce constraints. Bayesian 
statistics allows the introduction of such con¬ 
straints in a natural way. The Bayesian approach 
is now widely used throughout many disciplines, 
such as economics, medicine, astrophysics, and 
condensed matter physics. 

This talk is a report on progress being made 


Table 2 

Estimates for the lowest-lying energies Eq and Ei 
from the standard analysis applied to Example 2. 


^min 

^max 

Q 

Eq 

El 

2 

6 

0 

0.90102(78) 

0.98012(91) 

3 

7 

0.04 

0.8894(16) 

0.9685(21) 

4 

8 

0.17 

0.8847(35) 

0.9573(46) 

5 

9 

0.65 

0.8703(63) 

0.928(11) 

6 

10 

0.52 

0.832(18) 

0.895(27) 

7 

11 

0.64 

0.799(68) 

0.828(47) 


Figure 3. Effective masses with fits for 5 < t < 9. 
See Fig. ||for other details. 

in exploring methods of extracting physical ob¬ 
servables from Monte Carlo estimates of corre¬ 
lation functions which incorporate Bayesian in¬ 
ference into the statistical analysis. Such proce¬ 
dures can fully exploit the information present in 
the correlation functions at small temporal sepa¬ 
rations and use this information in a way consis¬ 
tent with fundamental probabilistic hypotheses. 
The computed errors on the best-fit parameters 
include both statistical uncertainties and system¬ 
atic errors associated with the treatment of the 
excited-state contaminations. Example 1 will be 
used to verify that such new methods work in 
cases where the signal is excellent. Example 2 
will then be used to see if the new methods pro¬ 
vide an improved means of extracting the infor¬ 
mation in cases when the signal is more obscured. 
An earlier report on such explorations is given in 
Ref. j|]. Bayesian methods have been previously 
discussed in the context of lattice QCD in Ref. [|| , 
and more recently, using maximum entropy meth¬ 
ods in Refs. [|-§. 

Since these methods employ Bayesian infer¬ 
ence, a brief review of the Bayesian approach is 
given in Sec. The application of the Bayesian 
approach to the particular problem of interest 
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here, namely, extraction of energies from Monte 
Carlo determinations of correlation functions, is 
then detailed in Sec. 

2. BAYESIAN REGRESSION 


since. 

The cornerstone of Bayesian statistics is Bayes’ 
theorem: 


p{M\Dr\i) 


p{D\Mni) P{M\I) 
JdM P{D\M np) P{M\I)’ 


Bayesian regression is different from the clas¬ 
sical (frequentist) approach. In the classical ap¬ 
proach, the data are the only source of informa¬ 
tion explicitly taken into account in constructing 
an estimate or test. In the Bayesian approach, 
an estimate or test is produced by combining 
the current data with information from past ex¬ 
perience and/or theoretical constraints (prior in¬ 
formation). In this section, Bayesian regression 
is described in general terms. Some introduc¬ 
tory textbooks on Bayesian statistics are given in 
Refs. For particular emphasis on Bayesian 

regression, see, for example. Refs. 00 , and for 
a few other discussions of Bayesian methods in 
physics, see Refs. mu- 

The Reverend Thomas Bayes was a Presbyte¬ 
rian minister born in 1702 in London, England, 
and who died April 17, 1761. His theory of prob¬ 
ability was published posthumously by Richard 
Price in 1763 in an article entitled an Essay to¬ 
wards solving a problem in the doctrine of chances 
in the Philosophical Transactions of the Royal 
Society. The theorem presented by Bayes was 
restricted to the binomial distribution, but its 
generality was most likely recognized by Bayes. 
Bayes’ theorem was generalized beyond the bino¬ 
mial distribution by Laplace in 1774, most likely 
without knowledge of Bayes’ work. Standard (fre¬ 
quentist) statistical methods were developed later 
than Bayesian methods. Although linear regres¬ 
sion and goodness of fit first appeared in the 
late 1800’s, the field blossomed during the 1920’s 
with the works of Fisher, Neyman, and Pearson, 
with a flurry of research and applications during 
World War 11. Bayesian methods are much older, 
but were largely ignored or actively opposed until 
the 1950’s when they were championed by promi¬ 
nent non-statisticians, such as physicist H. Jef¬ 
freys and economist A. Bowley. Their popularity 
grew rapidly during the 1970’s with the advent 
of affordable computers, and heated debates over 
the superiority of either method have raged ever 


where L> represents the data, M denotes the 
model which one believes should describe the 
data, and / represents our prior knowledge or 
background information about the system. The 
conditional probability distribution P(U|Mn/) is 
known as the likelihood of the data, P{M\I) is the 
prior probability distribution, and P{M\Dr]I) is 
called the posterior probability distribution. The 
term in the denominator of Eq. i) is independent 
of M and so can often simply be absorbed into an 
overall normalization. Bayes’ theorem essentially 
states that 

posterior oc likelihood x prior. 

Bayesian regression uses the posterior distribu¬ 
tion for all statistical inference. Model param¬ 
eters are estimated using one’s favorite statistic 
but evaluated using the posterior distribution. 
Common measures of central tendency include 
the mode, mean, and median, and common mea¬ 
sures of dispersion include the variance, skewness, 
and kurtosis. For example, if the model is speci¬ 
fied by a set of parameters u, then the mean value 
and variance of a model parameter Uj are given 

by 

K) (6) 

var{uj) = jdu — P{M{u)\D f] I). (7) 

The mode of Uj is found by maximizing the pos¬ 
terior probability, that is, maximizing the proba¬ 
bility of the model M being correct given data D 
and subject to the background information I. 

The likelihood P{D\M D I) is the probability 
distribution of the finding the data D given the 
particular set of model parameters M and back¬ 
ground information I, and from the Central Limit 
theorem has the form 

P(iJ|M n I) oc exp(—x^/2), (8) 

where is the familiar form given in Eq. (|^). 
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The new and key feature of Bayesian statistics 
is the prior P(M\I). In standard maximum like¬ 
lihood hts, the Bayesian prior P{M\I) is taken to 
be a constant and thus, maximizing the likelihood 
becomes equivalent to minimizing x^- When the 
number of fit parameters is small compared to the 
number of data points, this is a reasonable thing 
to do. However, when the number of fit parame¬ 
ters becomes comparable or larger than the num¬ 
ber of data points, the above procedure can be¬ 
come unstable and/or yield parameter estimates 
with overly large uncertainties, as mentioned pre¬ 
viously. The role of the Bayesian prior P{M\I) 
is to filter out parameter values that are impossi¬ 
ble and improbable, given some prior knowledge 
about the solutions, and hence, introduce con¬ 
straints into the hts. 

Prior distributions are specihed based on in¬ 
formation accumulated from past studies (ear¬ 
lier data), the opinions of subject-area experts, 
and/or from theoretical constraints. One often 
considers logical connections between the param¬ 
eters and symmetries which the prior distribution 
must obey. To simplify the computational bur¬ 
den, practitioners often restrict the choice of prior 
to some familiar distributional form. In some 
cases, the prior can be endowed with very little 
informative content (such as maximum entropy 
which amounts to letting a monkey throw balls 
into bins). Strictly speaking, using the observed 
data to inhuence the choice of prior is against the 
Bayesian philosophy. However, it is legitimate to 
use a small subset of the data to guide the con¬ 
struction of the prior and to use the rest of the 
data in the regression itself. It is very important 
that one should avoid putting in more informa¬ 
tion than is truly known, since results do and 
should depend on the prior. The prior is often 
viewed both as a nuisance and as an opportunity. 

3. CHOICE OF PRIOR AND TESTING 

By introducing a Bayesian prior, we can now 
use the model function of Eq. (^ but summing 
over many more exponentials: 

^exp 1 

^ exp(-t£;p), (9) 

p=0 


where N^xp is the number of exponentials, Ncot 
is the number of correlators to be simultaneously 
ht, and i = 0,..., Acor— 1- To ensure positivity of 
the coefficients and to order the energies, define 

A« = (&«)', E^ = Er._, + el (10) 

so that the actual parameters used in the ht- 
ting are Ua = {Eq, e„,In total, there are 
(Acor+l)Aexp fitting parameters. We then choose 
a Gaussian prior 

P{M\I) oc exp ^ . (11) 

In this work, we shall not include a prior for 
the first Ncov energies, so that the sum over 
a in Eq. ( 0 ) includes all parameters except 
Eq, ei, e 2 ,..., ewcor-i- Hence, there are NcorNexp 
terms in the summation over a. In using such a 
prior, we incorporate our expectation that the pa¬ 
rameter corresponding to the index a most likely 
has a value between rja — Ua and rja P oa- 

To simplify matters, we assume that the ener¬ 
gies of the excited-state contamination above the 
Acor-th level are most likely to be equally spaced. 
Also, due to the variational construction of our 
operators, we expect that correlator j is domi¬ 
nated by the Ej exponential. We further simplify 
matters by taking the same 77 ^ and Ua for all 
the . All other coefficients are expected to be 
small and are given the same expected range of 
values. To summarize, our prior is specihed using 

S , O; Cjj j — Acor . . . Aexp 1, 

r, a^bf, j = 0...Acor-l, (12) 

7, a^bP, i^j, 

and for the Gaussian widths, we similarly use 
(Te, CTr, and a^. Ghoosing values for Aexp, £, 
r, 7 , CTe, CTr, and completes the specihcation 
of our prior. Aexp is easily chosen by increasing 
its value until the energies of interest stabilize. 
The widths must be chosen sufficiently large so 
that the hts are not overly constrained beyond 
our knowledge, but overly large widths could lead 
to overly large uncertainties in the best-ht pa¬ 
rameters. Note that describing the prior is very 
straightforward, in contrast to describing how one 
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Figure 4. Effective masses and fits for the lowest 
four energies of Example 1 using the Bayesian 
analysis described in Sec. ^ and 0 < f < 18. Prior 
parameter values are A^exp = 34, e = 0.2, CTe = 
0.1, r = 0.9, err = 0.2, 7 = 0.05, = 0.05. 


Figure 5. Effective masses and fits for the lowest 
four energies of Example 2 using the Bayesian 
analysis described in Sec. ^ and 0 < t < 9. Prior 
parameter values are A"exp = 54, e = 0.1, Ue = 
0.05, F = 0.9, or = 0.2, 7 = 0.05, = 0.05. 


subjectively chooses fitting ranges tmin and tmax 
in the standard analysis. 

The results of applying this Bayesian curve fit¬ 
ting method in the two examples previously de¬ 
scribed are shown in Figs. ^ and The fit values 
shown correspond to the mode of each quantity, 
that is, the value of the quantity at the maxi¬ 
mum of the posterior distribution. The errors 
have been computed assuming a normal distribu¬ 
tion about this maximum for reasons discussed 
below. These errors automatically combine the 
statistical errors in the data with the systematic 
uncertainties in our prior knowledge. The method 
works remarkably well in both examples, elimi¬ 
nating the need to subjectively determine tmin- 
The insensitivity of the results to the parame¬ 
ters in the prior probability is demonstrated in 
Fig. I Notice that the error estimates increase 
significantly only in case (e) which is a drastic 
change in the prior. This insensitivity indicates 
that Eg and Ei are largely being determined by 
the data. Of course, the data contain little in¬ 
formation about many of the other parameters, 


and hence, such parameters depend strongly on 
the prior. We have found that this new method 
works well as long as e is not set much larger than 
the expected level spacings in the system and the 
allowed ranges of the coefficients are not set too 
large, as in case (e). 

Currently, there remains one fly in the oint¬ 
ment, namely, the computation of the error esti¬ 
mates. We have evaluated the multi-dimensional 
integrals needed to calculate the variances in the 
fit parameters using both the Metropolis method 
and HMC and have encountered severely long 
auto-correlations. The presence of many poorly 
constrained parameters leads to long, narrow 
ridges in the probability distribution which ham¬ 
per the Markov chain Monte Carlo integration. 
For the Metropolis method, we have commonly 
observed auto-correlation lengths in the tens of 
thousands. The problem is somewhat amelio¬ 
rated with HMC, reducing the auto-correlation 
lengths by factors of twenty or so, but clearly, 
this is still not sufficient. Work continues on this 
topic. An alternative method of computing the 
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Figure 6. Sensitivity of ifo and Ei to changes in 
the parameters of the Bayesian prior. For each 
point, the parameters in the prior are the same 
as stated in Figs. ^ and||, except for the following 
changes: 

(a) e = 0.1, (Je = 0.05, 

(b) £ = 0.2 , (Te = 0 .10, 

(c) £ = 0.3, (Te = 0.15, 

(d) F = 0.8, or = 0.3, 7 = 0.1, a.y = 0.1, 

(e) F = 0.7, or = 0.7, 7 = 0.7, cr^ = 0.7. 


errors (which deviates from a strict Bayesian phi¬ 
losophy) is suggested in Ref. dj. 


4. CONCLUSION 

Bayesian regression techniques are a viable al¬ 
ternative method for extracting physical observ¬ 
ables from stochastically-determined correlation 
functions. By allowing the regression to take into 
account prior knowledge of the system from theo¬ 
retical considerations and/or previous experience, 
Bayesian methods can make use of information 
obtained at short temporal separations and can 
incorporate systematic effects into error estimates 
of the observables. However, Bayesian methods 
are not a cure for bad data. 

Much of this work arose from a collabora¬ 
tive effort to explore Bayesian methods. Oth¬ 
ers involved in this effort are Peter Lepage, 
Bryan Clark, Christine Davies, Kent Hornbostel, 
Paul Mackenzie, and Howard Trottier. The au¬ 


thor wishes to acknowledge especially fruitful 
discussions with Peter Lepage, and also with 
Robert Swendsen, Julius Kuti, Jimmy Juge, 
David Richards, and Mike Peardon. This work 
was supported by the U.S. National Science Foun¬ 
dation under Award PHY-0099450. 
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